Design Requirements
1. Thinking in System Design
In system design interviews, you aren't expected to know the nitty-gritty details of every single component - it is more about the thought process and analysis that goes behind creating an effective distributed system. When designing a big enterprise system, it can be boiled down to three points: moving data, storing data, and transforming data.
1. Moving Data
When designing large systems, our focus shifts to moving data between different clients, across a network of servers, which may be geographically dispersed across the world. This is significantly more challenging compared to local data movement.
2. Storing Data
Choosing the right data storage solution depends on the specific needs of your system, similar to selecting data structures for algorithms. Distributed systems often deal with massive amounts of data that wouldn't fit on a single server's disk. The chosen storage solution should be efficient in terms of data access, retrieval, and modification.
3. Transforming Data
Example 1: If we were given a bunch of server logs, we output the percentage of successful requests vs. the percentage of failed requests. → sometimes this is handled by a monitoring service. Example 2: Given some medical records, we want to filter the patients by age.
Just like with data storage, choosing the best data transformation approach depends on the specific needs of your system. You want to achieve the desired outcome while optimizing for efficiency.
2. Good System Design
Bad design choices in the application architecture can be very costly and very difficult to correct later.
- e.g., the consequence of choosing the wrong database is having to migrate data from one database to another and, simultaneously, rewrite portions of the application.
Several factors determine what constitutes a good design.
1. Availability
Availability refers to the percentage of time the system is available, availability = uptime/(uptime + downtime)
.
- Uptime - the total amount of time the system is up. Downtime - the amount of time the system was unavailable to users.
- Nowadays, the systems we design need to have global connectivity, near 24/7 operations, where requests to access the system can be made concurrently, from different timezones.
Downtime can be planned (scheduled software update) or unplanned (hardware/software failure).
It is ideal to have 100% availability, but it just simply is not possible due to unplanned downtimes. Therefore, companies will aim for at least 99% availability.
- However, even 99% uptime means that out of 365 days, the system would be down for 3.65 days.
- If we want to bring uptime to 99.9% and downtime to 0.1%. → this is a big jump because it is a factor of 10 improvement. → Ultimately, availability is measured in terms of 9s.
- A good target for companies to have is 99.999% availability, which is 5 minutes of downtime in 365 days. This can be hard to achieve but is important for mission-critical systems.
SLOs and SLAs
The measure of availability is used to define SLOs (service level objectives) and SLAs (service level agreements).
- SLO defines a target for a specific aspect of your service. For example, AWS's monthly SLO can be: for the database to be available to users 99.999% of the time.
- SLA refers to an agreement a company makes with its clients or users to provide a certain metric of uptime, responsiveness, and responsibilities. For example, if AWS’s SLO is not met, they refund a percentage of service credit.
2. Reliability, Fault Tolerance, and Redundancy
- Reliability refers to the system’s ability to perform its intended function without failure or errors over a specified period of time.
- Fault tolerance refers to how well the system can detect and recover from a problem. e.g., disable a function, revert to a different mode, switch to a different server, etc.
In the context of server operations, reliability is the likelihood of the server operating without failure. During periods of heavy traffic or DDoS attacks, the server's fault tolerance is assessed by its ability to remain operational despite such challenges.
- Redundancy is a type of fault tolerance mechanism. This redundancy is provided by our backup server (having a redundant server). This server only comes into play if our primary server fails.
→ Having this backup gives us fault tolerance.
- Having two servers that were both active is called active-active redundancy.
A normal request with a successful response.
When there is a DDoS attack, only the intended users get the response back.
3. Throughput
Throughput refers to the amount of data/operations we can handle over some period of time.
- The throughput of a client making requests to a server would be measured through the number of requests per second.
- To measure how many requests a database can handle, we can use queries/second, the number of requests made by the user to the database.
- Throughput can also be measured in bytes/second, the maximum amount of data that can be sent over a network at any given time.
→ To improve the throughput, we can perform vertical/horizontal scaling.
Vertical scaling allows more requests to be handled by a single server.
Horizontal scaling allows more requests to be handled by multiple servers.
4. Latency
Latency refers to the delay between the client making the request and the server responding to that request → the time it takes for each individual request to be completed.
- latency is not exclusive to networks but also exists within a computer’s internal components (CPU access data from RAM and cache).
Round trip latency for a user making a request.
Distributed systems enhance availability, reliability, throughput, and latency for users globally by placing servers in various locations around the world. This means users experience faster response times (lower latency) and have continued access to services (higher availability) even if one server location encounters problems. Additionally, distributed systems can handle heavier loads (increased throughput) by distributing processing tasks across multiple servers.